Serveur d'exploration MERS

Attention, ce site est en cours de développement !
Attention, site généré par des moyens informatiques à partir de corpus bruts.
Les informations ne sont donc pas validées.

Construction of a facsimile data set for large genome sequence analysis

Identifieur interne : 004A40 ( Main/Exploration ); précédent : 004A39; suivant : 004A41

Construction of a facsimile data set for large genome sequence analysis

Auteurs : Oliver Seely Jr. [États-Unis] ; Da-Fei Feng [États-Unis] ; Douglas W. Smith [États-Unis] ; Daniel Sulzbach [États-Unis] ; Russell F. Doolittle [États-Unis]

Source :

RBID : ISTEX:B7439848BFC50D16530B081813FB2D99A18BB6B6

English descriptors

Abstract

Abstract: A test was devised for exploring the question of whether it will be possible to identify genes in largescale genome studies solely by sequence comparison with current sequence collections. To this end, a facsimile data set was constructed by dividing GenBank Release 56 randomly into two halves, one to serve as a reference set and the other intended to simulate raw data anticipated from large genome sequence projects. All supplementary information and identifying marks were removed from the test set after assignment of random identification numbers to each entry and their encryption. Because noncoding intervening sequences (introns) are underrepresented in GenBank, a program that introduced (simulated) introns into mRNA and prokaryotic sequences was devised. In a further attempt to make the problem of identification more realistic, random base substitutions and single-base deletions were also incorporated. The randomly ordered entries were concatenated, along with random intergenic flanking sequences, into a single long “chromosome” 33 Mb in length and then cut into “cosmids” 50–100 kb long. The chopping process was conducted in such a way that terminal overlaps would allow the order of the entries in the chromosome to be reconstituted. Finally, the sequences of a substantial fraction of the cosmids were converted to their complements. Preliminary searching of 10 test cosmids revealed that more than two-thirds of the entries in the test set should be readily identifiable by type of gene product solely on the basis of comparison with the reference set. These preliminary results suggest that existing computer regimens and sequence collections would be able to identify the majority of eukaryotic genes in any new raw data set, the existence of introns notwithstanding. Moreover, the analysis can be conducted in pace with the data collection so that the search results and summary identifications will be instantly available to the research community at large.

Url:
DOI: 10.1016/0888-7543(90)90227-L


Affiliations:


Links toward previous steps (curation, corpus...)


Le document en format XML

<record>
<TEI wicri:istexFullTextTei="biblStruct">
<teiHeader>
<fileDesc>
<titleStmt>
<title>Construction of a facsimile data set for large genome sequence analysis</title>
<author>
<name sortKey="Seely Jr, Oliver" sort="Seely Jr, Oliver" uniqKey="Seely Jr O" first="Oliver" last="Seely Jr.">Oliver Seely Jr.</name>
</author>
<author>
<name sortKey="Feng, Da Fei" sort="Feng, Da Fei" uniqKey="Feng D" first="Da-Fei" last="Feng">Da-Fei Feng</name>
</author>
<author>
<name sortKey="Smith, Douglas W" sort="Smith, Douglas W" uniqKey="Smith D" first="Douglas W." last="Smith">Douglas W. Smith</name>
</author>
<author>
<name sortKey="Sulzbach, Daniel" sort="Sulzbach, Daniel" uniqKey="Sulzbach D" first="Daniel" last="Sulzbach">Daniel Sulzbach</name>
</author>
<author>
<name sortKey="Doolittle, Russell F" sort="Doolittle, Russell F" uniqKey="Doolittle R" first="Russell F." last="Doolittle">Russell F. Doolittle</name>
</author>
</titleStmt>
<publicationStmt>
<idno type="wicri:source">ISTEX</idno>
<idno type="RBID">ISTEX:B7439848BFC50D16530B081813FB2D99A18BB6B6</idno>
<date when="1990" year="1990">1990</date>
<idno type="doi">10.1016/0888-7543(90)90227-L</idno>
<idno type="url">https://api.istex.fr/ark:/67375/6H6-H8LNX0WC-9/fulltext.pdf</idno>
<idno type="wicri:Area/Istex/Corpus">002073</idno>
<idno type="wicri:explorRef" wicri:stream="Istex" wicri:step="Corpus" wicri:corpus="ISTEX">002073</idno>
<idno type="wicri:Area/Istex/Curation">002073</idno>
<idno type="wicri:Area/Istex/Checkpoint">001F65</idno>
<idno type="wicri:explorRef" wicri:stream="Istex" wicri:step="Checkpoint">001F65</idno>
<idno type="wicri:doubleKey">0888-7543:1990:Seely Jr O:construction:of:a</idno>
<idno type="wicri:Area/Main/Merge">004B18</idno>
<idno type="wicri:Area/Main/Curation">004A40</idno>
<idno type="wicri:Area/Main/Exploration">004A40</idno>
</publicationStmt>
<sourceDesc>
<biblStruct>
<analytic>
<title level="a">Construction of a facsimile data set for large genome sequence analysis</title>
<author>
<name sortKey="Seely Jr, Oliver" sort="Seely Jr, Oliver" uniqKey="Seely Jr O" first="Oliver" last="Seely Jr.">Oliver Seely Jr.</name>
<affiliation wicri:level="2">
<country xml:lang="fr">États-Unis</country>
<placeName>
<region type="state">Californie</region>
</placeName>
<wicri:cityArea>Center for Molecular Genetics, University of California at San Diego, La Jolla</wicri:cityArea>
</affiliation>
<affiliation wicri:level="2">
<country xml:lang="fr">États-Unis</country>
<placeName>
<region type="state">Californie</region>
</placeName>
<wicri:cityArea>San Diego Supercomputer Center, University of California at San Diego, La Jolla</wicri:cityArea>
</affiliation>
<affiliation wicri:level="2">
<country xml:lang="fr">États-Unis</country>
<placeName>
<region type="state">Californie</region>
</placeName>
<wicri:cityArea>1 Permanent address: Department of Chemistry, California State University, Dominguez Hills, Carson</wicri:cityArea>
</affiliation>
</author>
<author>
<name sortKey="Feng, Da Fei" sort="Feng, Da Fei" uniqKey="Feng D" first="Da-Fei" last="Feng">Da-Fei Feng</name>
<affiliation wicri:level="2">
<country xml:lang="fr">États-Unis</country>
<placeName>
<region type="state">Californie</region>
</placeName>
<wicri:cityArea>Center for Molecular Genetics, University of California at San Diego, La Jolla</wicri:cityArea>
</affiliation>
<affiliation wicri:level="2">
<country xml:lang="fr">États-Unis</country>
<placeName>
<region type="state">Californie</region>
</placeName>
<wicri:cityArea>San Diego Supercomputer Center, University of California at San Diego, La Jolla</wicri:cityArea>
</affiliation>
</author>
<author>
<name sortKey="Smith, Douglas W" sort="Smith, Douglas W" uniqKey="Smith D" first="Douglas W." last="Smith">Douglas W. Smith</name>
<affiliation wicri:level="2">
<country xml:lang="fr">États-Unis</country>
<placeName>
<region type="state">Californie</region>
</placeName>
<wicri:cityArea>Center for Molecular Genetics, University of California at San Diego, La Jolla</wicri:cityArea>
</affiliation>
<affiliation wicri:level="2">
<country xml:lang="fr">États-Unis</country>
<placeName>
<region type="state">Californie</region>
</placeName>
<wicri:cityArea>San Diego Supercomputer Center, University of California at San Diego, La Jolla</wicri:cityArea>
</affiliation>
</author>
<author>
<name sortKey="Sulzbach, Daniel" sort="Sulzbach, Daniel" uniqKey="Sulzbach D" first="Daniel" last="Sulzbach">Daniel Sulzbach</name>
<affiliation wicri:level="2">
<country xml:lang="fr">États-Unis</country>
<placeName>
<region type="state">Californie</region>
</placeName>
<wicri:cityArea>Center for Molecular Genetics, University of California at San Diego, La Jolla</wicri:cityArea>
</affiliation>
<affiliation wicri:level="2">
<country xml:lang="fr">États-Unis</country>
<placeName>
<region type="state">Californie</region>
</placeName>
<wicri:cityArea>San Diego Supercomputer Center, University of California at San Diego, La Jolla</wicri:cityArea>
</affiliation>
</author>
<author>
<name sortKey="Doolittle, Russell F" sort="Doolittle, Russell F" uniqKey="Doolittle R" first="Russell F." last="Doolittle">Russell F. Doolittle</name>
<affiliation></affiliation>
<affiliation wicri:level="2">
<country xml:lang="fr">États-Unis</country>
<placeName>
<region type="state">Californie</region>
</placeName>
<wicri:cityArea>Center for Molecular Genetics, University of California at San Diego, La Jolla</wicri:cityArea>
</affiliation>
<affiliation wicri:level="2">
<country xml:lang="fr">États-Unis</country>
<placeName>
<region type="state">Californie</region>
</placeName>
<wicri:cityArea>San Diego Supercomputer Center, University of California at San Diego, La Jolla</wicri:cityArea>
</affiliation>
</author>
</analytic>
<monogr></monogr>
<series>
<title level="j">Genomics</title>
<title level="j" type="abbrev">YGENO</title>
<idno type="ISSN">0888-7543</idno>
<imprint>
<publisher>ELSEVIER</publisher>
<date type="published" when="1990">1990</date>
<biblScope unit="volume">8</biblScope>
<biblScope unit="issue">1</biblScope>
<biblScope unit="page" from="71">71</biblScope>
<biblScope unit="page" to="82">82</biblScope>
</imprint>
<idno type="ISSN">0888-7543</idno>
</series>
</biblStruct>
</sourceDesc>
<seriesStmt>
<idno type="ISSN">0888-7543</idno>
</seriesStmt>
</fileDesc>
<profileDesc>
<textClass>
<keywords scheme="Teeft" xml:lang="en">
<term>Academic press</term>
<term>Amino acid sequences</term>
<term>Amino acids</term>
<term>Artificial introns</term>
<term>Base pairs</term>
<term>Base substitutions</term>
<term>Chicken collagen</term>
<term>Coding sequence</term>
<term>Coding sequences</term>
<term>Codon</term>
<term>Consensus sequence</term>
<term>Cosmid</term>
<term>Cray supercomputer</term>
<term>Current sequence collections</term>
<term>Current status</term>
<term>Diego supercomputer center</term>
<term>Direct comparison</term>
<term>Entry sequence</term>
<term>Exon</term>
<term>Exon length</term>
<term>Facsimile</term>
<term>Facsimile data</term>
<term>First step</term>
<term>Friezner degen</term>
<term>Genbank</term>
<term>Genbank documentation</term>
<term>Genbank entry</term>
<term>Genbank locus</term>
<term>Genbank release</term>
<term>Gene</term>
<term>Gene duplication</term>
<term>Gene products</term>
<term>General identification</term>
<term>Genome</term>
<term>Human genome</term>
<term>Human genome initiative</term>
<term>Intergenic sequences</term>
<term>Intron</term>
<term>Large amounts</term>
<term>Large genome</term>
<term>Last nucleotide positions</term>
<term>Lookup table</term>
<term>Macintosh</term>
<term>Macintosh computer</term>
<term>Mrna</term>
<term>Nucleic acids</term>
<term>Nucleotide</term>
<term>Peptide</term>
<term>Peptide sequences</term>
<term>Program gbprot</term>
<term>Prokaryotic</term>
<term>Prokaryotic sequences</term>
<term>Random intergenic sequences</term>
<term>Random sequences</term>
<term>Reading frame</term>
<term>Reading frames</term>
<term>Reasonable facsimile</term>
<term>Refset</term>
<term>Refset half</term>
<term>Sample cosmid</term>
<term>Search results</term>
<term>Seely</term>
<term>Sequence</term>
<term>Sequence data</term>
<term>Sequencing</term>
<term>Similar segments</term>
<term>Splice junctions</term>
<term>Supplementary information</term>
<term>Terminator codons</term>
<term>Testset</term>
<term>Unannotated genbank entries</term>
<term>Unknown sequence</term>
</keywords>
</textClass>
<langUsage>
<language ident="en">en</language>
</langUsage>
</profileDesc>
</teiHeader>
<front>
<div type="abstract" xml:lang="en">Abstract: A test was devised for exploring the question of whether it will be possible to identify genes in largescale genome studies solely by sequence comparison with current sequence collections. To this end, a facsimile data set was constructed by dividing GenBank Release 56 randomly into two halves, one to serve as a reference set and the other intended to simulate raw data anticipated from large genome sequence projects. All supplementary information and identifying marks were removed from the test set after assignment of random identification numbers to each entry and their encryption. Because noncoding intervening sequences (introns) are underrepresented in GenBank, a program that introduced (simulated) introns into mRNA and prokaryotic sequences was devised. In a further attempt to make the problem of identification more realistic, random base substitutions and single-base deletions were also incorporated. The randomly ordered entries were concatenated, along with random intergenic flanking sequences, into a single long “chromosome” 33 Mb in length and then cut into “cosmids” 50–100 kb long. The chopping process was conducted in such a way that terminal overlaps would allow the order of the entries in the chromosome to be reconstituted. Finally, the sequences of a substantial fraction of the cosmids were converted to their complements. Preliminary searching of 10 test cosmids revealed that more than two-thirds of the entries in the test set should be readily identifiable by type of gene product solely on the basis of comparison with the reference set. These preliminary results suggest that existing computer regimens and sequence collections would be able to identify the majority of eukaryotic genes in any new raw data set, the existence of introns notwithstanding. Moreover, the analysis can be conducted in pace with the data collection so that the search results and summary identifications will be instantly available to the research community at large.</div>
</front>
</TEI>
<affiliations>
<list>
<country>
<li>États-Unis</li>
</country>
<region>
<li>Californie</li>
</region>
</list>
<tree>
<country name="États-Unis">
<region name="Californie">
<name sortKey="Seely Jr, Oliver" sort="Seely Jr, Oliver" uniqKey="Seely Jr O" first="Oliver" last="Seely Jr.">Oliver Seely Jr.</name>
</region>
<name sortKey="Doolittle, Russell F" sort="Doolittle, Russell F" uniqKey="Doolittle R" first="Russell F." last="Doolittle">Russell F. Doolittle</name>
<name sortKey="Doolittle, Russell F" sort="Doolittle, Russell F" uniqKey="Doolittle R" first="Russell F." last="Doolittle">Russell F. Doolittle</name>
<name sortKey="Feng, Da Fei" sort="Feng, Da Fei" uniqKey="Feng D" first="Da-Fei" last="Feng">Da-Fei Feng</name>
<name sortKey="Feng, Da Fei" sort="Feng, Da Fei" uniqKey="Feng D" first="Da-Fei" last="Feng">Da-Fei Feng</name>
<name sortKey="Seely Jr, Oliver" sort="Seely Jr, Oliver" uniqKey="Seely Jr O" first="Oliver" last="Seely Jr.">Oliver Seely Jr.</name>
<name sortKey="Seely Jr, Oliver" sort="Seely Jr, Oliver" uniqKey="Seely Jr O" first="Oliver" last="Seely Jr.">Oliver Seely Jr.</name>
<name sortKey="Smith, Douglas W" sort="Smith, Douglas W" uniqKey="Smith D" first="Douglas W." last="Smith">Douglas W. Smith</name>
<name sortKey="Smith, Douglas W" sort="Smith, Douglas W" uniqKey="Smith D" first="Douglas W." last="Smith">Douglas W. Smith</name>
<name sortKey="Sulzbach, Daniel" sort="Sulzbach, Daniel" uniqKey="Sulzbach D" first="Daniel" last="Sulzbach">Daniel Sulzbach</name>
<name sortKey="Sulzbach, Daniel" sort="Sulzbach, Daniel" uniqKey="Sulzbach D" first="Daniel" last="Sulzbach">Daniel Sulzbach</name>
</country>
</tree>
</affiliations>
</record>

Pour manipuler ce document sous Unix (Dilib)

EXPLOR_STEP=$WICRI_ROOT/Sante/explor/MersV1/Data/Main/Exploration
HfdSelect -h $EXPLOR_STEP/biblio.hfd -nk 004A40 | SxmlIndent | more

Ou

HfdSelect -h $EXPLOR_AREA/Data/Main/Exploration/biblio.hfd -nk 004A40 | SxmlIndent | more

Pour mettre un lien sur cette page dans le réseau Wicri

{{Explor lien
   |wiki=    Sante
   |area=    MersV1
   |flux=    Main
   |étape=   Exploration
   |type=    RBID
   |clé=     ISTEX:B7439848BFC50D16530B081813FB2D99A18BB6B6
   |texte=   Construction of a facsimile data set for large genome sequence analysis
}}

Wicri

This area was generated with Dilib version V0.6.33.
Data generation: Mon Apr 20 23:26:43 2020. Site generation: Sat Mar 27 09:06:09 2021